Entropy, KL distance, and Deviance

Jesse Brunner

From whence our metric of information?

  • entropy as a measure of information
  • KL as a measure of distance
    • added uncertainty by using an approximation for the True distribution
  • Deviance as a metric of (relative) distance
    • do not need to know what is True

A very simple example

# True model
a <- 2
b <- 1.5
sigma <- 2

x <- c(1,5,7,10)
mu <- a+b*x

# observations
y <- round(rnorm(length(mu), 
                 mean=mu, 
                 sd=sigma), 
           1)

Entropy of data | True model

\[ H(p) = -\mathbb{E}\left[ \log(p_i)\right] = -\sum_{i=1}^n p_i \log(p_i) \]

# calculate entropy of data | True model
(ps <- dnorm(y, 
             mean=mu, 
             sd=sigma)
 )
[1] 0.1972397 0.0219918 0.1209854 0.1841351
-sum(ps*log(ps))
[1] 0.9712345

Let’s fit some two simple models

# fit model with just a mean
m0 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu ~ dnorm(5, 3),
    sigma ~ dexp(1)
  ), data=data.frame(x,y)
)
precis(m0)
           mean        sd     5.5%     94.5%
mu    10.061225 1.7454374 7.271679 12.850771
sigma  3.761666 0.9382176 2.262213  5.261119
# fit model with a mean and slope
m1 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu ~ a + b*x,
    a ~ dnorm(5, 3),
    b ~ dnorm(0,1),
    sigma ~ dexp(1)
  ), data=data.frame(x,y)
)
precis(m1)
          mean        sd      5.5%    94.5%
a     4.681483 1.4027937 2.4395479 6.923418
b     1.285315 0.2185142 0.9360866 1.634542
sigma 1.576934 0.4491628 0.8590849 2.294783

Let’s fit some two simple models

Cross entropy from using m0 to approximate Truth

\[ H(p, q) = -\sum_{i=1}^n p_i \log(q_i) \]

## cross entropy
(qs <- dnorm(y, 
             mean=preds_m0[1:4],  # probs if we use m0
             sd=mean(extract.samples(m0)$sigma)) )
[1] 0.02639825 0.06654662 0.05294836 0.02802186
-sum(ps*log(qs))
[1] 1.790202
# added entropy by using m0 to approximate True
-sum(ps*log(qs)) - -sum(ps*log(ps))
[1] 0.8189678

Cross entropy from using m1 to approximate Truth

\[ H(p, q) = -\sum_{i=1}^n p_i \log(q_i) \]

## cross entropy
(rs <- dnorm(y, 
             mean=preds_m1[1:4],  # probs if we use m1
             sd=mean(extract.samples(m1)$sigma)) )
[1] 0.09555285 0.06663565 0.22113604 0.17771967
-sum(ps*log(rs))
[1] 1.023365
# added entropy by using m1 to approximate True
-sum(ps*log(rs)) - -sum(ps*log(ps))
[1] 0.05213056

Kullback-Leibler divergence

\[ D_{KL}(p,q) = \sum_{i=1}^n p_i\left[ \log(p_i) - \log(q_i) \right] \] measures the added entropy from using a model to approximate True

# added entropy by using m0 to approximate True
-sum(ps*log(qs)) - -sum(ps*log(ps))
[1] 0.8189678
## Dkl(p,q)
sum(ps*(log(ps)-log(qs)))
[1] 0.8189678
## --> it's the same!
## added entropy by using m1 to approximate true
-sum(ps*log(rs)) - -sum(ps*log(ps))
[1] 0.05213056
## Dkl(p,r)
sum(ps*(log(ps)-log(rs)))
[1] 0.05213056
## --> it's the same!

Can compare KL distances of our models

\[ \begin{align} D_{KL}(p,q) - D_{KL}(p,r) & = \sum_{i=1}^n p_i\left[ \log(p_i) - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ \log(p_i) - \log(r_i) \right] \\ & = \sum_{i=1}^n p_i\left[ - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ - \log(r_i) \right] \end{align} \]

# Difference in KL distances between m0 and m1
sum(ps*(log(ps)-log(qs))) - sum(ps*(log(ps)-log(rs)))
[1] 0.7668372
# We can get the same result if 
# we ignore the first log(ps) term in both quantities
-sum(ps*log(qs)) - -sum(ps*log(rs))
[1] 0.7668372

What if we do not know the Truth?

We almost have all of the \(p_i\) (ps) out of the quantity, but not quite

If we take out it out completely, we end up with log-probability scores, which are just unstandardized

\[ \begin{align} D_{KL}(p,q) - D_{KL}(p,r) & = \sum_{i=1}^n p_i\left[ - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ - \log(r_i) \right] \\ & \propto \sum -log(q_i) - \sum -log(r_i) \end{align} \] So we use log probabilities to describe fit and compare them between models <phew!>

But not quite: lppd

Have been pretending a single value for our expectations (the MAP) In actuality, have a full distribution (posterior). –> log pointwise predictive density

\[ \text{lppd}(y,\Theta)=\sum_i^n \log \frac{1}{S}\sum_s^Sp(y_i|\Theta), \]

# m0
sum(log(qs))
[1] -12.85752
sum(lppd(m0))
[1] -12.63249
# m1
sum(log(rs))
[1] -8.293116
sum(lppd(m1))
[1] -8.301116

But not quite: We use deviance

-2*sum(lppd(m0))
[1] 25.1819
-2*sum(lppd(m1))
[1] 16.74806

(smaller is better)

These are our metrics of fit! See that m1 is way better at fitting our data than m0